A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability
نویسندگان
چکیده
Large-scale distributed systems are very attractive for the execution of parallel applications requiring a huge computing power. However, their high probability of site failure is unacceptable, especially for long time running applications. In this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable distributed shared memory (DSM). Although most recoverable DSM require speciic hardware to store recovery data, our scheme uses standard memories to store both current and recovery data. Moreover, the management of recovery data is merged with the management of current data by extending the DSM's coherence protocol. This approach limits the hardware development and takes advantage of the data replication provided by a DSM in order to limit the amount of transferred pages during the checkpointing. The paper also presents an implementation and preliminary performances evaluation of our recoverable DSM on an Intel Paragon with 56 nodes. In particular, it shows performance degradation introduced by the fault tolerance mechanisms on failure-free executions, compared to a standard DSM on the same architecture. A Recoverable Distributed Shared Memory Integrating Coherence and Recoverability 3 1 Introduction
منابع مشابه
UsulDSM: A Page-based Recoverable Distributed Shared Memory Project Report
UsulDSM is a page-based recoverable software distributed shared memory system designed for network of computers that don’t have access to a physically shared memory. In this report we describe architecture of the UsulDSM and discuss its design and implementation. We also evaluate its performance through a simple parallel application that uses UsulDSM. We also analyze UsulDSM’s scalability and t...
متن کاملAn Extended Coherence Protocol for Recoverable DSM Systems with Causal Consistency
This paper presents a coherence protocol for recoverable Distributed Shared Memory (DSM) systems with causally consistent read-write objects. It uses independent checkpointing tightly integrated with coherence operations. That integration results in high availability of shared objects and ensures fast restoration of the consistent state of DSM in spite of multiple node failures, introducing lit...
متن کاملA memory approach to consistent, reliable distributed shared memory
Fault-tolerant distributed shared memory systems do not always need to support a complete and consistent recovery after a failure. We describe a framework, within which di erent approaches to, and different degrees of consistency and recoverability can be understood. The addition of consistent failure recovery may be approached from two di erent viewpoints: either by an application-oriented vie...
متن کاملA Memory Approach to Consistent, Reliable DSM
Fault-tolerant distributed shared memory systems do not always need to support a complete and consistent recovery after a failure. We describe a framework , within which diierent approaches to, and different degrees of consistency and recoverability can be understood. The addition of consistent failure recovery may be approached from two diierent viewpoints: either by an application-oriented vi...
متن کاملReplication of Checkpoints in Recoverable DSM Systems
This paper presents a new technique of recovery for object-based Distributed Shared Memory (DSM) systems. The new technique, integrated with a coherence protocol for atomic consistency model, offers high availability of shared objects in spite of multiple node and communication failures, introducing little overhead. It ensures fast recovery in case of multiple node failures and enables a DSM sy...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1995